February 6, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Recall that the goal of supervised machine learning is to learn some function that generally does a good job of mapping the input features to an outcome.
Generically, we assume that:
\[y = f(\mathbf x) + \epsilon\]
We would like to find an approximation to this function, \(\hat{f}(\mathbf x)\), that minimizes the generalization error:
\[\hat{f}(\mathbf x) = \underset{f^*(\mathbf )}{\text{argmin }} E_\mathcal T[\mathcal L(y - f^*(\mathbf x))]\]
Unfortunately, we don’t get to see this quantity and need to approximate it.
Most commonly, we rely on the empirical risk minimization framework. Given training data:
\[E_\mathcal T[\mathcal L(y - f^*(\mathbf x))] = \expn \mathcal L(y_i - f^*(\mathbf x_i)) + \frac{\lambda}{N} C(f^*(\mathbf x))\]
where \(C()\) is a measure of the complexity of the approximating function.
Instead, let’s restrict ourselves to learning methods that are universal approximators
Universal Approximator:
A learning method that can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error.
Deciphering the nerd speak:
A universal approximator can uncover any mapping of \(\mathbf x\) to \(y\) that is continuous on a bounded subset of \(\mathbb R^P\)
Even less syllables:
It can learn any reasonable function
IMPORTANT!
Just because it can, doesn’t mean that it will.
Reason 1:
The optimization method used to try to learn the function can’t actually find the function
A practical reason that a universal approximator won’t find the truth.
IMPORTANT!
Just because it can, doesn’t mean that it will.
Reason 2:
The training algorithm may choose the wrong function due to overfitting
All we get is the training data, so a universal approximator may not uncover the minimum generalization error function
Confusing irreducible error for bias.
There’s a lot!
KNN
Explicit Polynomial Feature Transformations
Implicit Polynomial Expansions via Infinite Basis Kernels
Regression/Classification Trees
And neural networks with at least one hidden layer
Problems:
Exploding parameter space
Poor generalizability
Inability to extend well to complex data types
Let’s start with a simple example: the XOR problem
\[\mathbf X = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix} \]
\[ \mathbf y = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 0 \end{bmatrix} \]
Goal: shatter the space with a classifier!
Starting point: linear classifier.
Does anyone remember the VC dimension of a linear classifier?
There is no way that a linear classifier can shatter this space!
Can we get around this with polynomials?
At a 5th degree polynomial, this works!
But, it seems a little silly
All we really need to do here is draw two lines instead of one.
Let’s make things simple (and non-differentiable) by stating an alternative loss function - the Heaviside function:
\[\mathcal L(\beta | \mathbf X, \mathbf y) = \expn I(\theta_i \ge 0)^{y_i} (1 - I(\theta_i \ge 0)^{1 - y_i})\]
\[\theta_i = \mathbf z_i^T \bb + b\]
where \(z_i^T\) is the set of features for instance \(i\).
\(\theta_i\) is determined using a linear equation
Let’s consider a set of chained equations
\[h_{i1} = \mathbf w_1^T \mathbf x_i + c_1\] \[h_{i2} = \mathbf w_2^T \mathbf x_i + c_2\]
\[\theta_i = \bb^T \mathbf h_i + b\]
More compactly noted, this chain can be represented as:
\[\theta_i = \bb^T(\mathbf W^T \mathbf x_i + \mathbf c) + b\]
where \(\bb\) is a 2 vector of weights, \(\mathbf W\) is a \(2 \times 2\) matrix of weights, \(\mathbf c\) is a 2 vector of bias terms (think intercepts), and \(b\) is an additional bias term.
This is a multilayer perceptron
We start with the input features
We pass the inputs to a hidden layer
The hidden layer is then passed to the output layer
Let’s suppose that we know \(\mathbf W\) and \(\mathbf c\):
\[\mathbf W = \begin{bmatrix} 1 & 1\\ 1 & 1 \end{bmatrix} \]
\[\mathbf c = \begin{bmatrix} 0\\ -1 \end{bmatrix} \]
Then we can compute the values of \(\mathbf H\) in the hidden layer as:
\[ \mathbf H = \mathbf X \mathbf W + \mathbf c = \begin{bmatrix} 0 & -1\\ 1 & 0\\ 1 & 0\\ 2 & 1\\ \end{bmatrix} \]
\(\mathbf h_i\) is the hidden representation of the original features given the coefficients for the hidden layer!
The goal of a multilayer perceptron is to get around a “bad” natural space where the instances cannot be separated linearly by engineering a new feature space
Typically through affine transformations of the original feature space!
Recall that this is the basis of the kernel trick; just not going from \(P \to \infty\) immediately…
The hope is that the new space admits a linear classifier!
That didn’t work…
The problem is that a linear hidden layer (no matter how big it is) is just a linear model with more steps…
\[\theta_i = \bb^T(\mathbf W^T \mathbf x_i + \mathbf c) + b\]
\[\theta_i = \bb^T\mathbf W^T \mathbf x_i + (\bb^T\mathbf c + b)\]
Taking a long walk for a short drink of water.
The trick: activation functions
Use some function to create a nonlinear mapping of the feature space to the hidden space.
We’ll start with the most common one called the restricted linear unit - ReLU for short.
Given a scalar input
\[\varphi(x) = \text{max}(0, x)\]
We’re taking \(x\) and putting a kink in it at 0!
We typically apply ReLU elementwise to a vector or matrix
\[ \mathbf H = \mathbf X \mathbf W + \mathbf c = \begin{bmatrix} 0 & -1\\ 1 & 0\\ 1 & 0\\ 2 & 1\\ \end{bmatrix} \]
\[ \mathbf H = \varphi\left(\mathbf X \mathbf W + \mathbf c\right) = \begin{bmatrix} 0 & 0\\ 1 & 0\\ 1 & 0\\ 2 & 1\\ \end{bmatrix} \]
We’ve separated the points in the hidden space!
To finish this specification, we set:
\[\bb^T = [1,-2]\]
and \(b = 0\).
Therefore, the decision boundary will be any point where:
\[\varphi\left(\mathbf X \mathbf W + \mathbf c\right)\bb + b = 0\]
We got it right!
How many parameters did it take?
4 weights in the hidden layers
2 bias terms
2 “logistic” weights
1 “logistic” bias term
9 terms (for four inputs…)
The 5th degree polynomial model took 21 parameters!
Why not have three hidden units in the hidden layer?
A flexible specification of a single layer fully connected neural network (FCNN).
Let \(\mathcal L(\boldsymbol \theta | \mathbf x, y) = g(\boldsymbol \theta , \mathbf x, y)\) be a generic loss function with unknowns \(\boldsymbol \theta\).
Let \(\mathbf x\) be an arbitrary input vector of length \(P\).
\[\theta = \boldsymbol \beta^T \phi(\mathbf x; \mathbf W, \mathbf c) + b\]
\(\mathbf W\) is a weight matrix with number of rows equal to \(P\) and number of columns equal to the number of hidden units, \(K\)
\(\mathbf c\) is a \(K\) vector of bias terms
\(\phi()\) is a function applied to the combination of \(\mathbf x\), \(\mathbf W\), and \(\mathbf c\) that induces nonlinearities
\(\boldsymbol \beta\) is a \(K\)-vector of coefficients used to compute the loss.
\(b\) is a bias term for the loss layer.
Most commonly, the single layer FCNN is set up with a linear predictor in the hidden layer:
\[\theta = \boldsymbol \beta^T \phi(\mathbf W^T \mathbf x + \mathbf c) + b\]
This is the most “basic” FCNN
Surprisingly flexible as \(K\) gets larger!
Given a choice of \(K\) and a loss function, we can find the coefficients that minimize the loss!
Let’s start with a two-class classification problem.
A chain of four equations! \[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = - \expn y_i \log \theta_i + (1 - y_i) \log ( 1 - \theta_i)\]
\[\theta_i = \sigma(\mathbf z_i^T \boldsymbol \beta)\]
\[\mathbf z_i = \varphi(\mathbf h_i)\]
\[\mathbf h_i = \mathbf x_i^T \mathbf W\]
Getting rid of the bias terms for simplicity. In this setup, just add a 1 to the front of the layer input to get the bias term.
Let’s let \(\varphi()\) be the ReLU function:
\[\varphi(x) = \text{max}(0, x)\]
Top layer: Logistic regression over the \(K\)-dimensional hidden space
What is the hidden space?
Think about this as a familiar problem.
\[\underset{N \times K}{\mathbf H} = \underset{N \times P}{\mathbf X} \underset{P \times K}{ \mathbf W}\]
By virtue of including the ReLU nonlinearity, amounts to a kind of nonlinear PCA
The biggest difference is what we’re optimizing to:
PCA finds an arrangement that optimally reduces redundancies
Hidden spaces for FCNNs find an arrangement that optimally classify observations!
Unlike PCA pre-processing, make the different dimensional hidden space a direct part of the problem!
Our unknowns:
\[\bb \in \mathbb R^{K + 1}\]
\[\mathbf W \in \mathbb R^{P + 1 \times K}\]
We know that we’re unlikely to be able to find analytical minima here, so we’ll need to rely on the gradient to use SGD methods to optimize this function!
In theory, we could write our own implementation of SGD that finds a minimum point in this loss space!
In practice, we don’t really need to do this.
For simple FCNNs, we can use sklearn.neural_network.MLPClassifier
For larger and more complex FCNNs, we’ll need to use Keras/TensorFlow/PyTorch
from sklearn.neural_network import MLPClassifier
# Train a 1-layer neural network with 1 hidden unit
nn_clf = MLPClassifier(hidden_layer_sizes=(1,), activation='relu', solver='adam', random_state=42)
nn_clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
# Compute the training accuracy
y_train_pred = nn_clf.predict(X_train)
training_accuracy = accuracy_score(y_train, y_train_pred)
# Create a grid of points
xx, yy = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500))
# Use the trained model to predict the class
Z = nn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Create a contour plot
plt.figure()
contour = plt.contourf(xx, yy, Z, cmap='viridis', alpha=0.8)
# Set the title of the plot to the number of hidden units and the training accuracy
plt.title(f'Hidden Units: {nn_clf.hidden_layer_sizes[0]}, Training Accuracy: {training_accuracy:.2f}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()Each hidden unit for a single layer neural network is a partition in the original feature space
Due to the ReLU activation, one side of the partition maps to 0 on the hidden unit
The other side maps to its linear predictor
from sklearn.neural_network import MLPClassifier
# Train a 1-layer neural network with 1 hidden unit
nn_clf = MLPClassifier(hidden_layer_sizes=(2,), activation='relu', solver='adam', random_state=42)
nn_clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
# Compute the training accuracy
y_train_pred = nn_clf.predict(X_train)
training_accuracy = accuracy_score(y_train, y_train_pred)
# Create a grid of points
xx, yy = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500))
# Use the trained model to predict the class
Z = nn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Create a contour plot
plt.figure()
contour = plt.contourf(xx, yy, Z, cmap='viridis', alpha=0.8)
# Set the title of the plot to the number of hidden units and the training accuracy
plt.title(f'Hidden Units: {nn_clf.hidden_layer_sizes[0]}, Training Accuracy: {training_accuracy:.2f}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()If we let the number of hidden units go to \(\infty\), we’ll recover the nasty classifier!
A single layer NN with enough hidden units is a universal approximator
Why did I stop at 1000?
The NN compute time scales in the number of hidden units rather than \(P\) (more or less)
Any intuition?
The question is how does a single layer neural network interpolate arbitrary functions?
A chain of three equations! \[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn (y_i - \mathbf z_i^T \bb)^2\]
\[\mathbf z_i = \varphi(\mathbf h_i)\]
\[\mathbf h_i = \mathbf x_i^T \mathbf W\]
What we end up with is a set of local linear approximations
Each region defined by the ReLU functions over the hidden space (areas where the ReLU does not evaluate to zero) has the same slope
With more and more units, we get more and more linear approximations
At a certain granularity, what is the difference between lines and curves? Another victory for Taylor!
We can uncover any arbitrary function using a single layer NN with \(K\) arbitrarily large!
This gets computationally unruly, though
We’re, more or less, computing \(K\) \(P\)-dimensional linear regressions
\(K = 10000\) is a lot of linear regressions!
There’s a reasonable improvement - add more layers of hidden units.
\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = - \expn y_i \log \theta_i + (1 - y_i) \log ( 1 - \theta_i)\]
\[\theta_i = \sigma(\mathbf z_i^T \boldsymbol \beta)\]
\[\mathbf z_{i} = \varphi(\mathbf h_{i})\]
\[\mathbf h_{i} = \mathbf q_i^T \mathbf W_1\]
\[\mathbf q_i = \varphi(\mathbf g_i)\]
\[\mathbf g_i = \mathbf x_i^T \mathbf W_2\]
A two layer, full connected neural network (FCNN) can be succinctly defined in two equations:
\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]
\[\theta_i = g(\bb^T \varphi(\mathbf W_2^T \varphi(\mathbf W_1^T \mathbf x_i + \mathbf c_1) + \mathbf c_2) + b)\]
\(K_1\) hidden units in the first layer
\(K_2\) hidden units in the second layer
Two sets of hidden weight matrices: \(\mathbf W_2\) (\(P \times K_2\)) and \(\mathbf W_1\) (\(K_2 \times K_1\))
Why depth rather than width?
Let’s show some benefits via simulation:
Same training data
Two layer neural network with \(K_1 = K_2\) hidden units
One layer neural network with \(2 \times K_1\) hidden units
The total number of vectors of coefficients is the same!
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import time
# List of numbers of hidden units
hidden_units = [2, 5, 10] + list(range(25, 101, 25))
# List to store the accuracies
accuracies = []
ttime = []
# For each number of hidden units
for h in hidden_units:
start_time = time.time()
# Create a MLPClassifier object
clf = MLPClassifier(hidden_layer_sizes=(h, h), activation='relu', solver='adam', max_iter=1000, random_state=42)
# Train the MLPClassifier
clf.fit(X_train, y_train)
# Predict the classes of the training set
y_pred = clf.predict(X_train)
# Compute the accuracy of the classifier
accuracy = accuracy_score(y_train, y_pred)
# Record the accuracy
accuracies.append(accuracy)
ttime.append(time.time() - start_time)
# List to store the accuracies for the single layer neural network
accuracies_single_layer = []
ttime_single_layer = []
# For each number of hidden units
for h in hidden_units:
start_time = time.time()
# Multiply the number of hidden units by two
g = h*2
# Create a MLPClassifier object with a single layer
clf = MLPClassifier(hidden_layer_sizes=(g,), activation='relu', solver='adam', max_iter=1000, random_state=42)
# Train the MLPClassifier
clf.fit(X_train, y_train)
# Predict the classes of the training set
y_pred = clf.predict(X_train)
# Compute the accuracy of the classifier
accuracy = accuracy_score(y_train, y_pred)
# Record the accuracy
ttime_single_layer.append(time.time() - start_time)
accuracies_single_layer.append(accuracy)DNNs with an equivalent number of weight vectors to single layer NNs:
Achieve higher accuracy with less hidden units
Faster computationally than equivalent single layer model
For compute time, any intuition as to what’s going on here?
How are we able to achieve higher accuracy with a DNN?
It really comes down to how it constructs the hidden units!
Unique patterns found: 2760
The top layer receives information and attempts to put the regions into an order that will directly minimize classification error
A 2 layer model breaks the problem into two steps
Find regions in the feature space that are similar regardless of spatial similarity
Classify in the reordered space that has already attempted to put like regions together in a smooth way!
DNNs work so well because they treat the feature extraction problem in a hierarchical way!
For classification, this is how a human would approach the problem
First, try to put like observations with like observations regardless of class
Then, apply classification to groups of instances instead of one at a time
Hierarchical pattern finding
Next class
More on DNNs
Lack of strict convexity and optimization difficulty
Backpropogation